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Abstract 



Many existing or potential programming tools require the program to be 
completely recompiled with a special compiler option. This is usually incon- 
venient for the program developer, and may reduce the usefulness of the tool 
or the frequency with which the tool is employed. It may also require the 
maintenance of different versions of standard libraries, each compiled with 
the appropriate options for a different tool. 

The difference between modules compiled with and without the special op- 
tion is often simple and regular. If so, we can effect this difference by 
modifying the normally-compiled object code at link time, instead of recom- 
piling. This reduces the overhead of using the tool by an order of magnitude, 
making it much more convenient. 



1. Introduction 

Recompiling an entire multi-module program from scratch is usually so expensive 
that one does it only reluctantly. In spite of this, many useful tools for program 
optimization or performance analysis require the recompilation of the entire program. 
The recompilation is done with a compiler option specifying that the resulting program 
will be used in connection with that tool. 

A common example of this is the gprof [7] profiler. A gprof profile tells, among 
other things, how many times and by whom each procedure was called. The normal 
way to use gprof is to recompile all of your source modules with the compiler option 
-pg and then link the new object modules with standard libraries that have themselves 
previously been compiled with the same option. This results in an executable program 
augmented with instrumentation code to keep track of the calls that occur. Running 
this instrumented program produces a file of profile data, which the gprof program 
uses to produce a formatted, readable profile. 

The requirement that you recompile your source modules is a stringent one. Glo- 
bal recompilation may take one or two orders of magnitude longer than the usual 
development cycle, and when you have obtained your profile you must then recompile 
everything once more in order to remove the instrumentation. On top of that, the 
library of instrumented library routines must be maintained as well. This makes sys- 
tem administration more complicated, and requires disk space for an instrumented ver- 
sion of every standard library. More subtly, the need for an instrumented library 
makes it more awkward to allow several fundamentally different kinds of instrumenta- 
tion. 

It turns out that the effect of compiling with -pg is quite small, hardly worth the 
expense of recompilation. An alternative is to modify the object code itself. If we 
build a step into the linker to convert uninstrumented object modules into instrumented 
ones, we can avoid the expense of recompiling altogether. Since the linker extracts the 
necessary library modules from the library, we can do the same transformation on 
these, eliminating the need for a separate instrumented library. 

We first developed this capability to enable certain very global optimization. As 
part of the Mahler code generator for DECWRL's Titan, we built an interprocedural 
register allocator and pipeline instruction scheduler, which have been described else- 
where [20,21]. These optimizations required us to develop a technology of link-time 
code modification. Although we did not realize it immediately, it was an easy step to 
use the same machinery, usually in a simpler form, for performance instrumentation as 
well. 
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This paper is divided into two major sections. The first describes the technique 
of link-time code modification, in more detail than previous reports have done. The 
second describes a variety of different applications we have explored using this tech- 
nique. 

2. Related work 

The notion of modifying object code is not fundamentally new. Peephole optimi- 
zation is the process of improving generated code by making local transformations of 
it, and has been around for decades [15]. More recently, attempts have been made to 
characterize some or all of the normal optimization process as "peephole" optimiza- 
tion, by generating naive code and then transforming it into better code [5,6]. These 
techniques differ from ours in two ways. First, they have been used only for the pur- 
pose of optimization and not for performance analysis. More important, these tech- 
niques have been applied as the last part of compilation rather than as the first part of 
linking. This means that the compiler was free to keep the code in an internal or 
incomplete form, more convenient for transformation, instead of producing standard 
object modules that could be linked normally. 

At the other extreme is the pixie tool developed at MIPS Computer Systems [16]. 
This tool transforms a fully-linked executable program into an equivalent instrumented 
executable. Thus it transforms the code after linking instead of before. Pixie was 
developed independently from our system but the possible modifications overlap with 
ours; the motivation for their system was quite similar to ours. 

The pixie system works by transforming a finished executable instead of by 
transforming object files being linked. This has two consequences. First, their system 
is easier to use than ours. In our system the user must know how to perform the link 
step, so that the appropriate code modification option can be requested. On the other 
hand, our approach is easier to implement. An isolated object file, prior to linking, 
contains the relocation dictionary and loader symbol table, which make the transforma- 
tion easier. For example, pixie must postpone some address translation until the 
modified program is executed, because not enough is known about the meanings of 
data items when pixie is performing the transformation [14]. In contrast, our system 
does all address translation as part of code modification. Because our approach is 
easier to implement, we have (I believe) used it in a wider variety of ways. No doubt 
any of our applications could be done within the framework of pixie, but the fact is 
that many of them have not. 

3. Modifying code 

Our system works as follows. The compiler always compiles code the same way 
(neglecting options that are irrelevant to this paper). The linker reads the object 
modules into memory one at a time, from object files or from libraries. If the user 
asks the linker for a transformation, then each module is modified in memory; no 
modified object file is written. The module rewriter is essentially independent of the 
linker proper. It accepts a module that the linker proper has read from a file, and it 
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returns a changed module that could just as well have been read from a file. The 
linker combines the object modules, modified or not, into a single executable image, 
which it writes to an executable file. 

The linker has two obstacles it must overcome for our system to work. First, it 
must determine what transformations must be applied. These transformations might 
include changing, adding, or deleting instructions at various points in the module. 
Second, it must actually make the transformations, and convert the object module into 
an equivalent module that is internally consistent, so that it can then be linked 
correctly. 

The problem of determining the changes to make can be easy or hard, depending 
on the transformation that is desired. For some applications, the compiler must mark 
certain points in the code so that the linker will know what to do there. Such a mark 
can be made by including a loader symbol whose value is the address marked. The 
compiler routinely includes all these marks, for all different kinds of instrumentation. 
A mark is irrelevant unless the user asks the linker for a transformation with which the 
mark is associated. Other important points can be determined from information that is 
in the normal object format. For example, the linker can determine the boundaries 
between basic blocks by looking for all text addresses in the loader symbol table and 
by looking at the destinations of all pc-relative branches. 

This leaves the question of making the modifications correctly. 

Changing an instruction is easy, and in fact all linkers do that routinely. Reloca- 
tion is the process of filling in part of an instruction with address or offset information 
that is not known until link time. Normally this is done with the aid of a relocation 
dictionary that contains the addresses of instructions and data items that must be relo- 
cated, together with a reference to the loader symbol whose value determines the relo- 
cation to be done. The reason it is easy is that it is essentially context-independent. 
Once we know what change to make to the instruction, it can be made without 
affecting any other instructions. 

Adding or deleting instructions is somewhat harder. When instructions are added, 
others must slide down to make room for them. When instructions are deleted, others 
must slide up to fill in the gap. Sliding an instruction up or down changes its address, 
and we must correct for this change if the module is to remain self-consistent. To do 
this, we must know all the places in code or data where the address appears, explicitly 
or implicitly, and we must modify these places so that the correct new address appears 
instead. 

This turns out to be easy to do, because most such places are marked for reloca- 
tion. For example, if the address of a procedure appears literally in some instruction 
or data word, it must be marked for relocation, because the address of the procedure 
may change as a result of the normal linking process. To put it another way, a normal 
linker already moves code up and down, because every module except the first must 
be positioned after all previous modules. Addresses that were correct relative to the 
beginning of the separately compiled module must be relocated when that module is 
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combined with others. 

Addresses can also appear implicitly, in pc-relative operations. These may also 
be marked for relocation. But they might not; a pc-relative jump to a destination in 
the same module can normally be completely resolved at compile time, so link-time 
relocation would be unnecessary. Fortunately, pc-relative references can be recognized 
from their own appearance. Either the instruction is one that always makes an implicit 
pc-relative reference, like some branch instructions, or it explicitly uses the pc as an 
operand. In either case we can compute the effective address by combining the 
address of the instruction with the offset in the instruction. 

Once we have found the places where addresses appear, we must know how to 
convert them into the correct new addresses. This requires us to make two passes over 
the module; one to plan the changes, so that we know how much correction is needed 
at each point, and one to commit the changes and correct the addresses that appear. (It 
is actually a little cleaner to go ahead and commit the changes in the first pass, but we 
must be careful not to lose the information that tells us where the addresses that need 
to be corrected appear.) After the first pass we can build a mapping from old addresses 
into new addresses, and use it to translate addresses in the second pass. 

In constructing this mapping we must remember that when we insert code 
between two instructions we will think of it as being either after the first instruction or 
before the second. This makes a difference. If there is a branch to the second 
instruction, we need to know whether that branch should still jump to that instruction, 
or instead should jump to the inserted code. If we want to go on jumping to the origi- 
nal instruction, we must translate the destination address of the old branch by adding 
the length of the inserted code; if we want to jump to the inserted code then we leave 
the destination address unchanged. 

It is important to remember that the transformation we are making is from one 
object module into another. We are not doing any of the operations one normally 
associates with the linking process; that will come later. All addresses in the old 
module are relative to the beginning of the module, and the same will be true of the 
modified module that we end up with. 

We want to produce an object module in the usual form, just like those the linker 
might read from object files. So we must make sure it has a correct relocation diction- 
ary and a correct loader symbol table. Addresses that appear in these must be 
corrected just as addresses in code or data are. 

The symbol table is easy. Entries in it are tagged with types that tell us whether 
or not they are addresses. We modify them by looking them up in the mapping, just 
as if they had appeared in the code. 

The relocation dictionary is easy, too. Each entry applies to one instruction or 
data item, designated by an address relative to the beginning of the module. We must 
correct the address, and also add or delete relocation entries that apply to instructions 
that are added or deleted. Correcting the addresses in relocation entries is different 
from the other address correction we have done, however. An address here specifies a 
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particular instruction rather than a location in the code. If we insert another instruction 
before a relocated instruction, we do not want the relocation suddenly to apply to the 
new instruction instead! The easiest way to accomplish this is to build a new reloca- 
tion dictionary as we build the new version of its associated code segment, adding new 
relocation entries whenever we add an old or new instruction that requires relocation. 
The new address is then manifest because we just finished putting the instruction there. 

To summarize, then, the general algorithm for modifying an object module is this. 
Based on the transformation requested, the linker determines the changes to be made. 
Some of these changes may be triggered by marks left by the compiler. In one pass it 
plans (and possibly makes) the individual changes. As it does this, it builds a mapping 
from old addresses to new addresses, in each case relative to the beginning of the 
module under consideration. On a second pass it commits the planned changes and 
corrects the addresses that appear in the changed code. During this second pass it also 
produces a new version of the code's relocation dictionary. It then goes on to correct 
the addresses that appear in the data segment and in the loader symbol table. The 
result is a new version of the object module, which can then be linked with others just 
as if it had been read from the object file. 

4. Applications 

We have developed a wide range of applications for our code modification tech- 
nology. Some were for very special purposes, used to investigate a particular question 
and then essentially discarded. Others were for production program development tools 
that we continue to use routinely. 

4.1. Interprocedural register allocation 

The first application was Mahler's interprocedural register allocator. This system 
has been described previously [20] so only an overview will be given here. 

At link time, the register allocator builds a call graph for the entire program and 
uses it to combine scalar variables into "pseudo-registers." Two variables can be 
assigned to the same pseudo-register if they are local to procedures that are never 
simultaneously active, or if they are local to the same procedure and the compiler has 
determined that the two are never simultaneously live. Global variables and constants 
are assigned to singleton pseudo-registers. The pseudo-registers are then sorted in 
order of frequency of use, computed from estimates of variable reference frequency 
produced by the compiler. The most frequently-used pseudo-registers become real 
registers. The advantages of doing this at link-time instead of compile time are that 
globals can be safely included and that locals in different modules can be given the 
same registers or different registers depending on whether they interfere with each 
other. 

Once the register allocator has decided which variables and constants should be 
kept in registers instead of in memory, the linker must rewrite the code to reflect that 
decision. It must remove loads and stores of these variables, and it must change the 
operands of other instructions so that they are the registers allocated to these variables 
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instead of temporary registers generated by the compiler. It must also deal with calls 
that are recursive or indirect through procedure variables, both of which require us to 
insert saves and restores at certain points. 

To make all these changes correctly, the linker relies on very extensive marking 
by the compiler. Essentially every point where the value of a variable is loaded, used, 
computed, or stored is marked so that the linker can recognize it and know which vari- 
able is relevant. This information requires a lot of space; about half the space in a 
typical object module is register allocation marks. However, the pervasiveness of 
these marks made it easy for us to see other applications of the technique, even when 
the information required by the application was much less. 

4.2. Pipeline instruction scheduling 

The Mahler system also includes a pipeline instruction scheduler [21]. The 
scheduler rearranges instructions so that there are fewer places where dependencies 
between instructions cause hardware pipeline stalls. In contrast to the register alloca- 
tion, the pipeline scheduler needs little global information. It simply breaks the 
module into basic blocks and re-orders each basic block independently. (In fact the 
basic blocks are not quite independent, as the scheduler may try to fill a branch slot in 
one block with an instruction from a block executed immediately after the first block.) 
It must also tolerate the presence of certain tricky idioms, such as a variable left shift 
implemented as an indexed jump into a table of constant left shifts, which it would be 
wrong to re-order. These idioms are marked by the compiler so that the scheduler can 
easily recognize them. 

Thus the scheduler makes use of no intermodule information. This means that 
the scheduling could be done in the compiler instead of in the linker. It is done in the 
linker for the simple reason that it works much better if done after register allocation, 
because the transformations done in register allocation make large changes in the 
dependency structure of the basic block. 

This is unfortunate in a production sense, because the scheduling is relatively 
expensive; when invoked, it slows down the linker by a large fraction. However, it 
has also allowed us to experiment with different pipeline structures. We created an 
interface that described a pipeline structure, and attached the interface both to the 
linker and to a fast instruction-level simulator. We could thereby schedule code 
according to the specified pipeline, and then simulate it according to the same pipeline. 
This gave us a more realistic estimate of the consequences of adopting a particular 
pipeline structure than we would have gotten had we merely simulated code produced 
for a different pipeline structure, or no particular pipeline at all. 

The interface assumes that instructions fall into disjoint classes, such as 
add/subtract, logical operations, singleword load, and so on. The classes are hardwired 
into the interface, but are intentionally quite specific. It is quite unlikely that two 
instructions in the same class will ever have substantially different pipeline properties. 
The user can combine the classes into groups called units. If the architecture has a 
functional unit that does add/subtract and also logical operations, the user would group 
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these into the same unit. The user can then specify five properties of the pipeline. 
First, each class has a result latency, which is the span between the time an operation 
begins and the time its result can be used in a later instruction, including bypass. 
Next, each unit has an issue latency, which is based on how intensively the unit is 
pipelined; it is the span between the time this unit issues one operation and the time 
when it can issue a second operation. This may be independent of the result latencies 
of these operations. Third, each unit has a multiplicity. There may be several 
instances of a given functional unit, so as to exploit parallelism beyond that provided 
by the degree of pipelining. Some pipelines impose an address latency because a 
memory reference needs its address operand earlier in the pipeline than other opera- 
tions. An address latency means essentially that an operation has a slightly longer 
result latency if its result is being used as the address in a memory reference. Finally, 
an architecture may allow several consecutive instructions to be issued in the same 
cycle, if they are mutually independent. This multiple issue is limited by the availabil- 
ity of the necessary functional units, and also by the maximum multiple issue. This is a 
limit on the maximum number of instructions that can issue in one cycle, regardless of 
whether more instructions and functional units are available. 

The interface has been extended to specify other machine properties as well, 
including the number of registers and the detailed structure of its caches and cache 
buffers. We have used it for a variety of architectural studies [9,10,11,12,13] as well 
as to estimate the performances of machines proposed both within DECWRL and by 
product groups elsewhere in DEC. 

4.3. Instruction-level instrumentation 

When the pipeline scheduler was finished, the question naturally arose of how 
well it did. Code modification provided a convenient way to answer this question, and 
was DECWRL' s first use of this technique for performance analysis rather than perfor- 
mance enhancement. 

The idea behind the instruction-level instrumentation is that we create a set of 
global counters in the instrumented program, one for each event of interest. For exam- 
ple, we may be interested in the number of load instructions that are executed. Then 
we analyze each basic block, and insert code in it to increment all the counters associ- 
ated with events that occur in that block. If we execute the resulting instrumented pro- 
gram, and then output the counters, we get dynamic counts of all the events. 

It is easy to use this technique to count any kind of event that is apparent from a 
static analysis of the basic block. For instance, we can count the executed loads or 
stores or coprocessor instructions. We can also count events like the execution of 
delayed branches whose branch slots were not filled. We can even count the execu- 
tions of various multi-instruction idioms such as byte load or store, or procedure call 
or return. 

To a first approximation, we can also count pipeline stalls with this method. We 
examine the basic block for sequences that will cause a pipeline stall, such as (on the 
Titan) a store immediately followed by a load. We insert code to increment the 
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"store-load-stall" counter by the number that we find in this block. This analysis is 
the same analysis that the scheduler itself must do. 

Counting stalls in this way is only approximate, however, because a stall can hap- 
pen at the boundary between two basic blocks. To make the counts precise, we create 
some auxiliary globals in the instrumented program, to record certain state information 
about the currently executing block. We check each block for instructions that might 
cause a stall, depending on what happens in the next block executed. When we find 
such instructions, we insert code to record the fact. For instance, if a basic block ends 
with a store instruction, we insert code that sets the "ends-with-store" bit. If a basic 
block begins with a load, we insert code that looks at the "ends-with-store" bit and 
increments the "store-load-stall" counter if it is set (and clears the bit in any case). 
This lets us exactly count events that depend on inter-block information. 

Instruction-level instrumentation provides us with statistics that one normally 
acquires by instruction-level simulation. Inserting the instrumentation at link-time is 
expensive, but it is an order of magnitude cheaper than inserting it at compile-time 
would be. Executing the instrumented program is expensive, too; the instrumentation 
can slow the program by an order of magnitude. But simulating the program instead 
would be slower by two to four orders of magnitude. For this kind of application, 
link-time instrumentation seems to be the best approach by far. 

4.4. Procedure-level profiling 

The example of gprof was mentioned earlier in this paper. This profiler usually 
requires the recompilation of all the source modules in the program, with the option 
-pg. However, the only effect of this option is to insert, at the beginning of each pro- 
cedure, a call to a special routine named mcount. This routine is responsible for keep- 
ing track of the calls that occur. In our approach, the compiler always marks the place 
where the call to mcount would be inserted. If the user tells the linker to link for 
gprof, then the linker inserts a call to mcount at each such point. 

The advantages are obvious. We can use gprof without recompiling from 
scratch. We do not need to maintain a separate instrumented version of each library. 
The world is now easier and faster for both the user and the system administrator. 

Others at DECWRL are experimenting with some extensions to this approach. 
Normally, gprof obtains execution times by "pc-sampling." The profiled program is 
run in parallel with another process that periodically looks at the pc of the profiled 
program, to determine which procedure is executing. This approach has a few disad- 
vantages. First, it means that assignments of expense are only statistical; if A calls P 
90 times and B calls P 10 times, then gprof can only assume that 90% of the time 
spent in P was for the benefit of A rather than B. This may be just plain wrong; A 
may call P only for easy cases. (Maybe that's why it calls so much.) Second, it may 
make it difficult to profile the kernel of the operating system. Some machines may not 
allow the kernel to be interrupted. Even if the kernel can be interrupted, the timer 
interrupts themselves trigger certain events in the kernel, which means that pc- 
sampling based on the same timer interrupts may cause a skewed picture of where the 
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kernel spends its time. Third, call overhead itself may account for a significant frac- 
tion of a program's execution time, and the pc-sampling technique charges the caller 
and callee for the amount of overhead that occurs in each. Different calling conven- 
tions might result in a much different profile. 

The Titan has a time-of-day clock that is visible to the user. We are experiment- 
ing with an alternative gprof that depends on the linker's ability to insert code noting 
the current time at key points. This will allow us to assign charges based on the 
actual current context. It will allow us to dispense with a parallel pc-sampling process, 
making kernel-profiling easier. And it will allow us to determine the expense of each 
call, charging whichever routine we wish, or reporting the time in some new way. 

4.5. Block-level profiling 

Just as we can insert code to keep track of the number of times we execute each 
procedure, we can insert code to count the executions of each individual basic block. 
Basic blocks are shorter than procedures, so we want to minimize the overhead of this 
counting. We therefore insert the counting code in-line rather than insert a call to a 
counting procedure. The linker allocates a long vector of counts, one for each basic 
block in the program, and the inserted code for a block increments the count associ- 
ated with that block. 

There are many applications of this information, most of which we have not yet 
explored. Our main use of the basic block execution counts is the construction of a 
variable-reference profile. Our intermodule register allocator normally selects the 
important variables by looking at estimates of variable reference frequency that are 
produced by the compiler. Often these estimates are not very good, and the allocation 
can be improved by using actual reference counts instead of estimates. A variable- 
reference profile tells us how many times each variable or constant was referenced dur- 
ing the program execution; if we feed it to the register allocator and relink, we get a 
better allocation and a faster executable. To build the variable-reference profile, we 
combine the basic block counts with static information that can be produced by the 
register allocator. This static information tells, for each basic block, how many vari- 
able references it contains, and which variables they are. This is easy for the register 
allocator to determine, because the references it counts are marked for removal in the 
event that the variable loaded or stored is assigned to a register. 

We hope to use procedure profiles and basic block profiles more extensively, to 
guide other aspects of the optimization process. If the optimizer knows where the pro- 
gram spends its time, it should be able to spend more effort optimizing those parts. 

4.6. Register management strategies 

In 1988, we reported the results of a study comparing register allocation tech- 
niques and hardware register windows [22]. A modern processor probably has a lot of 
registers, and exploits them by keeping variables in them. Different techniques exist 
for accomplishing this. Some machines [8,9,17,19] treat the register set as a flat space 
and rely on the language system to allocate the registers to variables. Other machines 
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[2,3,18] include hardware to divide the register set into circular buffer of windows. 
When a procedure is called, the tail of the buffer is advanced in order to allocate a 
new window of registers, which the procedure can use for its locals. 

Both techniques allow us to dispense with many loads and stores that would oth- 
erwise be needed. Both techniques also require us to include new loads and stores that 
would not otherwise be needed. In the case of register windows, the new loads and 
stores are needed when we run out of buffers after a long series of calls or returns. In 
the case of register allocation, the new loads and stores are needed when we make pro- 
cedure calls that would otherwise harm registers in use. In a naive allocation this 
might be every procedure call; our more sophisticated allocation inserts saves and 
restores only for recursive or indirect calls. 

Thus we can imagine evaluating a hardware or software register management 
technique, by counting the loads and stores saved and then subtracting the loads and 
stores inserted. Our study did this via link-time code modification. Register allocation 
techniques were measured by inserting code to count the different kinds of loads and 
stores. Register window techniques were measured by inserting code to maintain a 
simple data structure describing what the current state of the window buffer would be 
on a window machine with specified parameters. The results showed that hardware 
windows did not have a significant advantage over register allocation. 

4.7. Address traces 

An address trace is the sequence of instruction and/or data addresses referenced 
by a program during the course of its execution. Such a trace is useful in simulating 
the performance of different cache configurations. Previous techniques for acquiring 
address traces fall into two general categories. One approach involves adding micro- 
code to the machine [1]. The microcode watches the address bus and logs those it 
sees. This approach slows execution by an order of magnitude, and is not suitable for 
machines without microcode, like most modern RISC processors. The other approach 
is simulation. We can build an instruction-level machine simulator that also logs the 
memory references made by the simulated program. Unfortunately, simulation is very 
slow; simulating a program typically takes two to four orders of magnitude longer than 
executing it. With the advent of machines with very large caches, we must use very 
long address traces to get realistic and useful results. Simulation is too slow to do this 
easily. 

Our approach was to use link- time code modification to instrument the code. 
Wherever a data memory reference appears, the linker inserts a very short, stylized call 
to a routine that logs the reference in a large buffer. The same thing is done at the 
beginning of each basic block, to record instruction references. When the buffer fills 
up, something is done to process it; either it is dumped to an output device, or a cache 
simulation is resumed to process it. 

It is important that we not consider single programs in isolation, because the 
cache is shared by all user processes and also by the system kernel. Others at 
DECWRL [4] are extending this technique by making the system kernel allocate a 
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shared buffer that everybody can use; this allows the traces of kernel and user activity 
to be correctly interleaved. Some interesting work in the operating system is neces- 
sary, to synchronize things properly and to correct for the effect of the execution slow- 
down on the behavior of of time-triggered events in the kernel. 

5. Conclusions 

Link-time code modification is easy to do. The format of object modules and the 
structure of the linker provide a convenient interface for applying various transforma- 
tions on each module being linked. Where necessary, the compiler can routinely pro- 
vide hints that are ignored unless a relevant transformation is requested. 

Link-time code modification is flexible. We have used it for a wide variety of 
one-shot and production applications. We have used it as a vehicle for optimization 
techniques that can make impressive differences in the speed of programs. It has 
enabled performance instrumentation from the level of individual instructions up to the 
level of user procedures; almost by accident it has changed the standard tool gprof 
from something one might use a few times a year into something one might use a few 
times a month or week. We expect to be able to add to its uses for some time to 
come. 
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